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(57) Abstract: The present invention is based on the sequencing and assembly of the huntan genome. The present invention provides 
the primary nucleotide sequence of the coding portion of ite human genome in the form of a series of transcript scquenoss with 
accompanying cxon infoiroation. This infonnaUon can be used to generate nucleic acid dctisclion reagents and kits such as nucleic 
add airays, and for other uses. 
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KITS, SUCH AS NUCLEIC AOD ARRAYS, COMPRISING A MAJORITY OF HUMAN 
EXONS OR TRANSCRIPTS, FOR DETECTING EXPRESSION 

AND OTHER USES THEREOF 

FIELD OP THE INVENTTON 

The present invention is in the field of gooioniic discovery systems. The presmt 
invention specifically provides tiie coding sequences of Ihe human genome, including transciipt 
sequmces and corresponding exon infonnation, in a form that is commetciaUy usefiil, including 
detection kits and reagents such as nucleic add arrays. 

BACKGROUND OF THE INVENTION 

The hmnan genome is organi2»d into discrete expression units caU Genes are 

further divided into exons (coding sequences) and introns (intervenmg, non-coding sequences). 
RNA transcripts are the primary output of the genome and are generated through a process 
referred to as gene expression or transcription. Gene expression involves the transcription of 
DNA into pre-mRNA, followed by RNA processing of pre-mKNA mto mature mRNA 
transcripts, during which introns are removed and exons are spliced together to form cotxq^lete 
transcript sequences. However, ahemative splidng pafliways allow mtrons to be removed and 
exons to be combined in difiGraient combinations, thmby allowing different mKN As and proteins 
to be produced fiom flie same g^. It has been found tiiat nearly 40% of human genes are 
alternatively spliced (Brett et d., 2000, FEBS Lett. 474, 83). Differait splice forms of genes 
may play distinctiy different, and unportant, roles in different cellsAissues, developmental stages, 
or diseases and, tiierefore; the ability to detect different splice forms of the same gene is of 
paramount importance. Alternative splicing can also act as an on-off mechanism for mRNA 
activity by producing either fimctional or non-fimctional mRNAs ftom flie same pre-mRN A. 

A major goal in the development of therapeutics, diagnostic reagents, and pharmaceutical 
drugs is to understand and elucidate gene expression patterns and spUcmg patterns, particularly 
in different cells/tissues, developmental stages, and disease/palhological conditions. Determining 
when or under ^Aat conditions a particular gene or splice form is expressed, in ^ch 
cells/tissues, and to what extent is important for understanding die fimction of the protein 
encoded by the gene and its role in disease. Gene e^qiression and splicing patterns can be 
determined by reagents or kits, preferably nucleic acid arrays (also known as "DNA chips" or 
**biochips"), that utilize detection elemmts, such as nucleic add probes, to detect the expression 
of gene fiagments or the splicing together of exons to form mRNA transcripts. Such detection 
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elements may comprise, for example, fragments of, or conqplete, gene tiaascdpts or exons, 
fragments corresponding to UTR regions of fbe transmpt or detection elraients that span the 
exon/exon boundaries of a transcript The use of exons, or exon fragments, as detection elem^ts 
has tibe distinct advantage of allowing the detection of dififerent alternatively spliced transcript 

5 forms with the same detection element Ihis is possible so long as the transcript form contains 
the particular exon that is used as a detection dement, regardless of how that exon is combined 
with oth^ exons. On the other hand, the use of complete tcanscripte, or transcripts comprising 
moie flian one «on, generally allows the detection of only that particular splice form» or exon 
combination, and may not detect the expression of other important transcript sphce fonns. 

10 However, ttie use of transcripts as detection elements is advantageous in particular situations, 
such as when detection of only one particular transcript, witii a high degree of specificity and 
ipiTiimfti cross hybridization to otibier transcript forms, is desired. 

The primary sequence of the exonsAranscripts of the human genome would therefore be 
valuable for use in detection kits and reagents, sudi as nuddc acid arrays, for delecting gene 

15 expression patterns, mcluding variable gene expression such as alternative splicing, and other 
uses. Human exons^transcripts can serve as detection elements, such as probes, in d^ection kits 
and reag^ sudi as nucleic acid anays. Not only wUl such kits and reagents s^e as a basis for 
discovery and validation of conomerciaUy important genes, they provide commercially valuable 
tools for understanding the comply patterns of g^ie expression in relationship to different 

20 cells/tissues, developm»ital stages, and disease conditions. Consequently, human 
exons/transatqrts, provided in a usable form, such as in the form of detection elements in a 
nucleic acid army, would be valuable for disease diagnosis and treatment, such as by inqiroving 
tiie drug discovery and development process, or for diagnosing diseases based on abmant gene 
es^ression patterns. 

25 Furthermore, a substantial proportion of current gene discovery efforts is directed at 

mining EST databases. However, it has been estimated that EST databases nwy contam as Uttle 
as 40% of ttie protein-coding portion of the human genome (Aparicio, Nature GeneticSy June 
2000, 25: 129-130). Consequently, the primary sequ^ce of human transcripts and exons, 
identified through wfaol^-genome sequencing, assembly, and annotation, represents the best 

30 source of identifying protein-<:oding sequences of the human genome that are not rq)resented in 
EST databases. Therefore, the sequence of human exons/tcanscripts provided by the present 
invention is usefiil for identifying and vaUdating commercially valuable human genes. 

Gene expression analysis, using the transcript/exon sequences provided herein, is also 
useful for deterniining fimctions and relationships of genes with imknown fimctions. For 
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example, it has been shown in yeast that geaes with similar fimctions have similar gene 
expression proffles (Bisen et al. (1998) Proc. Natl. Acad ScL U.&A. 95, 14863-14868). 

Hie present invention advances tfie art by providing the predicted tianscript sequences 
(SEQ ro NOS: 1-39010), for 39010 transcripts predicted from the assembled hmnan gwome, 
many of vAsidh did not have evidence for their existence in the prior art Fmthfflnore, the present 
invention provides infoimation on each of the exons (Table 1) contained within the transcripts. 
The exon information contamed m Table 1 includes the coordinates of each cxon within its 
respective transcript, thereby allowing one to readily detennine the precise boundaries of each of 
the exons using the transcript coordinates and the transcript sequences as a reference. These exon 
boundaries define the exon-exon juncdons discussed herein. Also provided m Table 1 is 
evidence supporting the existence of each exon or transcript (e.g. EST hit, mouse hit, etc.). 

Given the transoipt sequences provided by die present invention and die exon coordinate 
information provided in Table 1, or fragments tiiereoi^ readily implementable conqjo^ons of 
matter, such as detection elements and detection reagent/kits, (e.g. in the form of probes in a 
nucleic acid array), can be made using methods well known in the art and discussed herdn. Such 
kits and rea^nts can be u^ to track die expression and/or splicing of all of the 
txanscripts^g^nes disclosed h^ein, the novel members herein provided, or rationally selected 
subsets thereof defined by a user. 



20 Nucleic Acid Arrays and Detection Kits and Reagents 

Oligonucleotide probes have long been used to detect complementary nucleic add 
sequmces in a nucleic acid of interest (the "target" nucleic acid) in the form of detection kits and 
reagents. In some assay formats, the oligonucleotide probe is tefliered, Le., by covalent 
attachment, to a solid support, and arrays of oligonucleotide probes unmobilized on solid 

25 supports have been used to detect specific nucleic acid sequences in a target nucleic acid. See, 
e.g., PCT patent publication Nos. WO 89/10977 and 89/11548. In other formats, the detection 
reagents are supplied m solution. 

The development of arraying t^hnologies such as photolithogrqdiic syndiesis of a 
nucleic acid array and hig^ density spotting of cDNA products has provided methods for making 

30 very large arrays of oUgonucleotide probes in very small areas. See U.S. PaL No. 5,143,854 and 
PCT patent publication Nos. WO 90/15070 and 92/10092. Microfebricated arrays of large 
numbers of oligonucleotide probes, called "DNA chips" offer great promise for a wide variety of 
applications. Such arrays may contain, for example, thousands or millions of probes. Probes may 
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be formed from, for example, cDNA clones, PGR products, or oligonucleotides and can be used 
in solution or tethered to a siqiport sudi as a planar sur&ce (chip) or bead format 

The present invention provides detection kits and reagents, such as nucleic add arrays, 
that are based on ttie novel tcanscript/exon sequences of the human genome provided herein, 
5 particularly the novel transcrqpts and novel information concerning exon structure of each 
transcript provided in the Sequence Listing and in Table 1. 



Medical Importance of Variable Gene Expression 

Variable gene e}q>ression, such as altemative splicing (also referred to by such terms as 

10 alternate splicing or diffetenttal splicing) and altemative start^t^mination sites, is a 
fundamentally important mechanism of gene regulation. Altemative splicing refers to the 
formation of two or more different mature mRNA splice forms from a single gene or pre-mRNA, 
dep^ding on the combination of exons that are spliced together. Altemative splicing flierefore 
serves as an important means of generating additional protein diversity from the structural 

15 information encoded by g^es. Furthermore, ejqiression of particular splice forms may differ 
between, for example, cells, tissues, devdopmmtal stages/ages, populations or sexes, and may 
be altered in certain disease states, such as cancer. Altemative splicing may have a detrimental 
effect on intercellular interactions and the interaction of various polypeptides and cytokines and 
thereby lead to diseases such as cancer. 

20 Detection reagents, such as ruicleic acid arrays and other multi transcript detection 

reagent/kit, that utilize detecdon elements comprised of individual transcripts or exons are 
c£^le of detecting altemative splice forms of genes that may be missed by detection reagents 
that detect only one transcript form. Detection reagents tibiat detect disease-specific splice forms 
of a gene are tiseful for disease diagnosis. For example, one or more detecdon reagents to each 

25 exon can be used to determine if an exon is present in a sample and/or detection reagents that 
span exon/exon boundaries can be used to see if a particular exon/exon splice junction is present 
and also selects against cross reactivity with gnomic DNA. 

Alternative splicing plays an important role in a variety of proteins and disease pathways, 
as the following examples illustrate. Elastin is a protein thai is important for providing the elastic 

30 propc^es of the lungs, large blood vessels, and skin. The primary elastin transcript undergoes 
substantial altemative spUcing, and it has been suggested that such altemative splicing of elastin 
may be population-specific and contribute to aging and pathological conditions in the 
cardiovascular and pulmonary systems (Indik et al, Am J Med Genet 1989 Sep;34(l):81-90). 
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Nitric oxide proteins are important in numaous physiological processes, audi as 
neurotransmission and muscle relaxation. At least six diffaent isoforms of neuronal nitric oxide 
mRNA have hem identified and found to diCfer m enzymatic properdes. Alternative splicmg 
provides a mechanism to generate this diversity. Furthranore, it has been oteerved that an 
5 alternatively spliced form of neuronal nitric oxide that lacks exon 2 is expressed in many human 
brain tumors (Brenman etal., DevNeurosci 1997;19(3):224-31), 

Alternative sptidng of the amyloid precursor protein mRNA, particularly variant splicing 
of exons 7 and 15, may be involved in the development of Alzheim«:'s disease (Beyreuther et 
al.. Am N YAcadSci 1993 Sep 24;695:9M02). 
10 A number of difisient estrogen receptor mRNA variants, many of i^cb are generated by 

alternative splicing, have been identified in breast cancer tissue and may be associated with the 
development and progression of breast cancer (McOuke et aL, Mol Esfidocrmol 1991 
Nov;5(ll);1571'-7). 

CD44 is a large family of transmembrane gly coprotdn isoforms that are generated fix>m a 
15 single gene by altematrve splicmg and are involved in a variety of cancers. For example, some 
CD44 isoforms have been found to be causally mvolved in lung metastasis formation. 
Fuithecmore, the expression levels of partipular CD44 isoforms are indicative of prognosis in 
numerous cancers, such as non-Hodgidn lymphoma^ ^istric, colon, renal, and mammary 
carcinomas; and in neuroblastomas (Gunthert et oL, Cancer Surv 1995^14:19-42 and Ponta et al., 
20 Imasion Metastasis 1994-95;14(l-6):82-6). Hierefore, detecting the expression of CD44 
altCTiative splice forms is usefixl for diagnosing diseases such as these cancers. 

Alternative splicing at three positions on the primary fibronectin transcript generates 
multiple fibronectin polypeptide variants. Furthermore, these different fibronectin variants play 
specific roles in fibronectin duner secretion, blood clotting, adhedon to lyn4>hoid cells, sId] 
25 wound healing, atherosclerosis, and liver fibrosis (Komblihtt et aL, FASEB J 1996 
Feb;10(2):248-57). 

Alternative splicing is important m the differentiation, maintenance, and fimction of the 
red blood cell memteane. This is highlighted by the finding that hereditary hemolytic anemias 
result fiom mutations that cause defective spUcing (Benz et aL, Trans Am Clin Climatol Assoc 

30 1996;108:78-95)- 

Platelet derived growth &ctor (PDQF), which is associated witii several diseases 
inchH'tig atherosclerosis and neoplasia, undergoes alternative splicing that could affect the 
fimction of PDGF fKhachiEian et al.. Pathology 1992 Oct;24(4):28a"90); consequently. 
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aitemlive splicing of PDGF may play a significant role in diseases sudi as adierosclerosis and 
neoplasia. 

As an example of the importance of alteanadve splicing in developm^ it is well knovni 
in the art tiiat sex-specific ahemative splicii^ in Drosophih plays an important lole in sex 
detemiination. 

Additionally, raonic splidng enhancers (^Es) aie sequrace elements within exons that 
piomote splicing and it has bem suggested that many human diseases linked to mutations or 
polymoiphisms witiiin exons may be caused by the inactivation of ESEs, tibuereby leading to 
defective stplicing (Bl«ico>ra, Trends Biochem Sci 2000 Mar;^(3):106-10). 

As these examples illustrate, such fields as therq)eutic/phaimaceutical drug development 
and disease diagnosis/treatment would greatly benefit fipom detection kits and reag^ts &at 
improve the detection of variable gene expression, such as the detection of alternative splice 
forms. 

Using Transcripts/Exotts as Detection Elemmts to Monitor Variable Ggpe Expression 
The transcript sequences and the corresponding exon structote of the transcript disclosed 
herein are usefiil in themselves as probe/primer sequences and in the design of such detection 
element, such as nucleic acid arrays or other detection kits. Transcript sequen c e s with exon 
structure is particulariy useful for studying variable forms of gene expression, such as the 
expression of alternative spUce forms and alternative start/termination sites* As the above 
exanq^les illustrate, alternative ^lice forms play important roles in a variety of disease 
conditions, such as cancer^ The importance of detecting alternative splice forms is finrther 
highlighted by the finding that neariy 40% of human genes are attematively spliced (Brett et al., 
2000, FEBS Lett., 474, 83); therefore, 40% of all human genes may express alternative transcript 
forms that are undetectable by conventional detection reagents that are not capable of detecting 
alternative splice forms of expressed genes. 

Individual exons are capable of detecting alteroative splice forms that comprise that 
particular exon, regardless of the combination in which that exon is spliced togetha: with other 
exons to form an alternative spUce form. Therefore, one or more detection elements directed to 
each single exon can be used to detect any splice form that includes that particular exon. For 
example, if exon 2 of a six exon gene is used as a detection element (for example, as a probe in a 
nucleic acid array), that detection element can detect the mRNA spUce form of exon 2 with 
exons 3 and 4, as well as the alternative mRNA splice form of exon 2 with exons 1, 5, and 6. 
These two different splice forms may have distinct fimctional properties and one of the two 
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splice fonns may cause a disease condition, or te diagnostic of a disease conation. Exon^based 
detection elem^ts, such as nucleic acid array probes, may be formed^ for example, fiom exons 
directly amplified fiom genomic DNA or syn&esized using &e sequences provided herein as a 
reference. 

Altem^vely, sequences Aat span an exon/exon junction (see Table 2) can be used to 
generate detection reagents that are usefiil in detecting expression and/or splice formation. Such 
reagents are particulariy useful in that detection signal caused by genomic contamination in the 

sample is greatly reduced. 

In addition to alternative splidng, variable gene esqiression also includes alternative start 
and tmnination sites. As with alt^native splicmg, detection reagents that employ individual 
exons are usefiil for detectiing tcansoipts with alternative start and/or alternative termination 
sites, so long as the transcript includes the exon that comprises tiie detection element 

Commonly used detection techniques that utilize one transcript form comprised of 
multiple exons Sfpliced together in a particular combination, such as probes formed firom cDNA 
Ubraries, are limited in that they vaU not detect transcripts tibat are con^msed of exons spliced 
together in a different combination, even if some of the exons are the same. Furthermore, such 
detection elements may not detect transcripts that comprise alternative start and/or termination 
sites. This prevents the detection of particular splice forms that may play important roles in, for 
example, certam disease pathways. Therefore, m certain applications, exon sequences are 

« 

preferable to largra* transcript sequences. 

Accordingly, a definite need exists in the art for exons of die human genome jnovided in 
a useful form, such as in the form of detection elements of a nucleic add army or other detection 
reagent/kit Exons provided in such a form would be extremely valuable for detecting alternative 
splice forms and other forms of variable gene expression. 

Using Sequences that Span Exon-Exon Junctions as Detection Elements 
Sequences that span exon-exon junctions in a transoipt are especially usefiil as d^ection 
elements, such as probes in a nucleic add array. In particular, sequraces that span exon-exon 
junctions eliminate felse signals caused by gnomic contamination. This is because a detection 
element oomprismg two neighboring exons as one contiguous sequence will not hybridize to 
gnomic DNA comprising intervening intronic DNA. Such detection elements will only 
hybridize to expressed mRNA transcripts in which the exons are connected and the intronic 
sequence has been removed, thereby forming one contiguous stretch of sequence corresponding 
to the sequence of the detection element tiiat spans die exon-exon junction. 

7 



wo 02/068579 PCT/US02/00284 

ExoQ-exon juncdons are provided by the present invention and identified in Table 1. 
Sequences sfKuining exon-exon junctions can teadily be detennined uiang the exon cooidinales 
provided in Table 1 along with the transcript sequences provided i^ These 
detection reagents alone, or in coinbination vnltk intrarexon probes, can be used to elucidate the 
S splicing and expression pattern of g^ies within a variety of tissues and/or treatment protocols* 

Using Tra nscripts as Detection Elements to Monitor Gene Expression 
Transoipt sequences of the human g^me are also useful for monitoring gene 
ejqnession patterns and, in certain ciicumstances» may be preferable to individual exons for use 

10 as detection elements for detecting gene expression. For example, die use of transccipts may be 
pzefecred v^en the goal is to monitor eaqxression of a particular transcript, or group of transcripts^ 
to the exclusion of all otiier transcripts, such as alternative splice farms. In this situation^ using 
transcripts as detection elonents, rather than individual exons, increases specificity and 
decreases undesired cross hybridization of the detection elements with alteinadve splice forms. 

15 Accordingly, a definite need exists in the art for transcripts of Ae human genome, as well 

as exons, provided in a usefol form, such as in the form of detection elements in a detection 
reagent/Idt, such as m a nucleic acid array. Transcripts provided in such a form aro useful for 
monitoruig particular fonns of gene expression with a high degree of i^peci^ Such detection 
elements can readily be generated using the sequence information provided herein. 

20 

SUMMARY OF THE INVENTION 

The present invention is based on the sequencmg and assembly of tiie human genome. 
The presmt invention provides tiie primary nucleotide sequence of the coding portions of the 
human genome in a series of predicted transcript sequences generated fix>m die assembled and 

25 annotated human genome (SEQ ID NOS:1-39010). Furthermore, the position of each exon 
contained within these transcripts is identified in Table 1. Individual exon sequences can readily 
be determined using the transcript sequences of SEQ ID NOS: 1-390 10 along with the 
coordinates of each exon within it's respective transcript, as provided in Table 1. This 
information can be used to readily generate nucleic acid detection reagents and kits, sudi as 

30 nucleic acid arrays* In particular, detection reagents are provided tiiat comprise at least one 
detection element, \^dierein at least one detection element comprises a transept selected from 
SEQ ID NOS: 1-39010. In preferred embodiments, at least one detection element of tiie detection 
reagent comprises an exon identified in Table 1. In other preferred embodiments, the detection 
reagent is a nucleic acid array and the detection elements may be, for exanqple, probes attached 
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Lore, in ofluer prefened embodiments, detection cedent 
comprises 10,000 or more detection elements, one or more from each of the novel 
transcripts/exons disclosed h( 



Detection elements that comprise a transcript sequence or an exon, particularly an exon 
selected from Table 1, allow one to id^fy variiable fonns of gene esqstession, such as different 
splice fonns of g^es ccmtaining the exon of flie detection element Variable fonns of gene 
don, such as aheroative splicing, may have inq)ortant tissue-specific, disease-specific, or 



-spedfii 



ion 



IS. Such variable fonns of g^e expression may go 
d by conventional detection techniques used in gene expression studies. Detection 
elements that comprise a transcript, particularly a transcript selected from SEQ ID NOS:l- 
39010, allow one to monitor the expression of die transcript Hiat comprises die detection element 
with a high degree of specificity. 

Furthermore, a prefrared dass of detection elements provided by the present invention 
comprises sequences spanning exon-exon junctions. Preferred sequences span one exon-exon 
junction, however, sequences may q>an any nunober of exon junctions. Sequences that span 
exon-exon jimctions are particulariy usefid in that they eliminate fidse signals caused by genomic 

Kovided by die present invention and 



1 using the exon 



Table 1. Sequences spannmg exon-exon junctions can readily be deb 
coordinates provided m Table 1 along with die transmpt sequences provided m the Sequence 
Listing. 

The present invention provides the nucleotide sequences of the coding porticm of the 
human genome, namely predicted transcript sequences and corresponding exon information, in a 
form that can be used, analyzed, and commercialized for other uses in addition to detection kits 
and reagents. For example, die present invention provides the nucleic acid sequences as 
contiguous strings of primary sequences in a form readable by computers, such as recorded on 
computer readable media, e.g., magnetic storage media, such as floppy discs, hard disc storage 
medium, and magnetic t£^e; optical stora^ media sudi as CD-ROl^ electrical storage media 
such as RAM and ROM; and hybrids of these categories sudi as tnagnetic/optical storage media. 
The piesent invention spedficaUy provides a CD-R that comprises tiiis sequence information On 
die form of a Sequence Listing, provided in file SEQUST.TXT on die accompanying CD 
labeled CLOOl lOlCDA). Such compositions are usefid fidr, for example, for virtual northern 
blot analysis, BLAST searching, discovery and validation of drug targets, and for comparative 
gnomic studies between genomes of different organisms. 
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Hie present inveation further provides systems, particularly co^^)utex-based systems, 
wbich contam the primary sequence information of the present invention stored in data storage 
means. Such systems are designed to identify commercially important firagm^ of the human 
genome. 

Another embodiment of Ae present invention is directed to isolated fiagm^ts, and 
collections of ftagments, of the human gmome. The fiagments of the human gmome include 
peptideKsodii^ fiagments, such as transcripts and exons. The transcript sequences (SEQ ID 
NOS:1-39010) aie provided in tiie Sequence listing, which is provided in file SEQUST.TXT, 
and the exon elements tiiat each transcaipt is conqmsed of are provided in Table 1, Mduch is 
provided m file TABLELTXT. Both files are provided on tiie accompanymg CD labeled 
CLOOIIOICDA- 

As discussed above, tiie present invention includes detection ret^gents and kits, sudi as 
nucleic acid arrays and microfluidic devices, tiiat comprise one or more fiagments of tiie human 
genome of tiie pi^ssent invention, particularly transcript sequences and/or isolated exon 
sequences. The kits, such as arrays, can be used to track tiie expres^onof many exons or genes, 
even all of the exons or genes, or rationally selected subsets tiiereo^ contained in the human 
genome. 



The identification of the coding set of sequaces fi»m the human genome will be of great 
value for a variety of commercial purposes. Many fiagments of the human genome will be 
immediately characterized by similarity searches against protein and nucleic add databases and 



identifying 



1 1 ti^fi^ I 



j& value to 



It in protein domains and will be of in 
researchers and for the ptodMction of proteins or to control gene expression. A specific example 

concerns secreted proteins, ion channels and G-protein coupled receptors. The biological 
significance of secreted proteins for controlling cell signaling, differentiation and prolifoation is 
wellknowiL 

Further, the development of tiierapeutic proteins and protdn targ^ for human 
intervention typically involves identiftdng a protem that can serve as a target for the 
develonment of a small molecule modulator. Many classes of proteins are well characterized as 

Mr 

suitable pharmaceutical drugs (protem therapeutics or modified forms thereof) and/or drug 
targets. These include, but are not liiniled to, secreted proteins, GPCRs and i^^ 



prief Descri ption of flie Files coy itumerf nn H P labeled CLO Ql lOlCDA 
1) File SEQLIST.TXT provides the Sequaace Listing of the transcript sequences of 
tiie present invration m text (ASCII) format The file size is 50.7 MB. 
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2) FUe TABLELTXr provides Table 1, which gives detailed infonnation on exon 
struchne for each of the transcript sequences in Ae Sequmce Listing. The aze of ttds file is 15 
MB and is stored in text (ASCH) format 



5 Brief Description of Table 1 

Table 1 gives the results of detailed conqnit^ analysis of the human genome. Table 1 
provides information on every identified human transcript and exon compri^ng every 
gene/coding region of the human genome, as follows: 

The SEQ ID NO: of each transcript sequence (cone^nding to SEQ ID NOS:1-39010 

10 provided in tfie file, SEQLIST.TXT), a Celera UID idoitiftring number for each 

transcript, a Celera CT identifying number for each transcript, numbers corresponding to 
each predicted exon contained within each transcript, predicted exon boundaries 
(indicating exon-exon junctions) identified by coordinates within the corresponding 
transcript, and siqiporting evidence for the existmce of each exon and/or transoipt, 

15 where available (H = human EST/cDNA sqpport. R = rodent EST/cDNA support, M 

mouse genomic support, and P - protein homology). 



Brief Description of the Figure 

The figure provides a blodc diagram of a computer system 102 that can be used to 
20 inqplement the conqmter-based systans of the present inventiorL 



nBT/yiT DES CRIPnON OF THE PREFERR ED EMBODIMENTS 
The present invention is based on the sequencing and assembly of the human genome. In 
this process, the primary nucleotide sequence of over 30 million nucleic acid fragments, from 
25 about 400 to about 600 nucleotides in length, was detmnined. These fragments were assembled 
using the Celera Assembler, After assanbly, the sequences were analyzed with various 
computer packages and compared with all external data sources. The result of this analysis was 
the identification of 39010 predicted protdn-coding transcripts contained in the human genome. 
The present invention provides the rmcleic add sequences of these transcripts (SEQ ID NOS:l- 
30 39010), along with corresponding exon information^ in a form fbat can be used, for example^ to 
readily develop rmcleic acid detection kits and reagents, such as nucleic acid arrays. 

The present invention provides the nucleotide sequences of the coding sequences of the 
human genome, including transcript sequmces (SEQ ID NOS:1-*39010) and corresponding exon 
information (provided in Table 1), in a form that can be readily used, analyzed, and mterpreted 
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by a skilled artisan. In one embodiment, tite sequmces are provided as contiguous strings of 
primary sequence information correqwnding to the nucleotide sequojces provided in SEQ ID 
NOS:1-39010, and/or the exons identified in Table 1; the delineated nucleotide sequence of eadi 
exon can readily be determined uang the transoript sequences of SEQ ID NOS:1-39010 along 
with fbft coordinates of each exon wiftin it's respective transcript, as provided in Table 1. The 
exon information is provided in file TABLEl .TXT and the transcript sequences are provided in 
file SEQUST-txt; both of these files are provided on the accompanying CD labeled 
CLOOIIOICDA. The information in Aese files has many commerdally inqnrtant uses. For 
example, the transcript/exon sequences and stnictuial infimnation provided herein can be used to 
generate commeiciatty vahjable nucleic acid or peptide fiagments, design and develop 
probea/primets, and to develop detection reagents and kits sudi as gene expression arrays. 
Furtfaemiore, flie sequence and structural mfomadon provided heron is valuable for a vwde 
variety of commercially impoxtsast computar-based biological analysis, such as virtual mmfaetn 
blot analysis of gpne eaqjression, BLAST searching, or conoqparalive genomic analysis of 
di£Eeient organiams. Uses such as tiiese enable tiie idoitification and validation of connnemally 
important genes and gene products, as well as diagnostic kits, tiierqieatics agents, and drug 
targets. 

In otiier embodhnents, tiie sequences of the present invention are represented by a 
detection reagent/kit tiiat is capable of identifying mKNA sequences tiiat hybridize to any 
particular exonic or transcript sequence provided herein. In particular, detection reagents and kits 
are provided tiiat comprise at least one detection dement, wherein at least one detection element 



comprises a transcript selected fiom SEQ ID NOS:1-39010, or a portion tiiereof In preft 
mbodiments, at least one detection element of tiie detection reagent con4>rises an econ spedfic 
detection element identified in Table 1. In otii» preferred embodimaits, at least one detection 
element of the detection reagent spans at least one exon-exon junction; exon-exon junctions are 



reagentAdt 



identified m Table I. Furthermore, in preferred eml 
nucleic add array and tiie detection elements may be, for example, probes attadied to tiie sur&ce 
of flie array. Otiier preferred detection reagents/detection elements mclude TaqMan probe^nimer 
sets, for monitoring gene expression usmg tiie TaqMan 5' nuclease PCR assay. Furthermore, in 
most of tiie preferred embodhnents, tiie detection reagent comprises about more tiian one 
detection element (sequence) and preferably, 10,000 or more such detection elemrails. Such 
detection reagents can be used to track flie expression of many gpnesftranscripts, or transcript 
jsocessing, even all of tiie transcripts/gaies/exons. or rationally selected subsets tiiereof 
contained in the human genome. 
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As used herein, "detection elenxmts'" «)irespond to an element, such as a niud^c acid 
probe, a piobe/primer or a binding aptamer, that is capable of selectively binding a 
transcript or exon sequence provided by the present invention, or a fragment thereof. Such 
detection elments include, for example, i^Iated oligonucleotides comprising the transcript/exon 
5 sequences provided terein, provided in a format such as in an array or in a TaqMan 5* nuclease 
PGR assay. Detection elmients, such as probes/primers, may be, for emnple, attached to a solid 
siq^port (e.g., m arrays) or stqpplied in solution (e.g., probe^primer sets for en^matic reactions 
sudx as PGR or RT-PCR). 

Additionally, ''detection elements" also include the transmpt/exon sequences and/or 
10 structural information provided herdn implemented in a computer-based system. For example^ 
tiie transcript/exon sequenc*^ provided herein may be used as detection elem^its for searching a 
computer-based database of sequence or e3q>ression information, such as for sequence similarity 
searching^ virtual norlhem blot analysis, BLAST searching, ^e discovery/validation, gene 
functional analysis, or comparative genomic/expression studies between different individuals, 
IS species/organisms, or disease conditions. 

Furdiermore, one of the prefisrred classes of detection elements proidded by the patesent 
invention comprises detection elem^ts that span an exourexon junction in a ttanscript Preferred 
detection elements span one exon-exon junction. However, detection dements may span any 
number of exon-exon junctions within a transcript Detection elements tiiat span exon-^on 
20 junctions are particularly usefiil in that fbey eliminate £Edse signals caused by graomic 
contamination. Exon-exon junctions are identified in Table 1. Sequences sparniing exon-exon 
junctions can readily be determined using the exon coordinates provided in Table I along with 
the transcript sequences provided in the Sequence listing. Thus, references herein to exon, 
transcript, or gene sequences also include sequences spanning one or more exon-exon junctions. 
25 detection reagents" and ^detection kits'' refer to any system or technology platform tibat 

utilizes detection elements comprising nucldc acid or peptide sequ^ces/molecules/fragments 
corresponding to die transcripts/exons of the preseaat inv»xtion, as described above. Thus, 
detection reag^ts or detection kits may refer to, for example, nucleic acid arrays (which may 
also be refierced to by such terms as *DNA chips*', ^iochips'', or "microarrays"), the TaqMan 5 ' 
30 nuclease PGR assay system and probe/primer sets, or odier enzymatic or PCR-based assay 
systems, solutions of probes and/or primers, compartmentalized kits, dot-blot or revise dot-blot 
systems^ sequencing systems, microfluidic systems, mass spec systons, and various computer- 
based systems such as databases of nucleic acid sequences, protein sequences, or expressed 
sequences. 
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The tenn *1iaQscripf ' is generally used heiem to le&r to coding or expi^sed segments of 
the human genome that comprise a set of one or more exons that form a mature mRN A molecule 
\spcm transcription/expxessioiL The temi "^anscript** is also used hsxGin to refer to the mRNA 
transcript molecule, as ^11 as tfie set of exons in gmomic DNA that comprise the mRNA 

S transcript molecule* Transoipts'* may also be refeired to herein as ^genes'", and vice vensa, in 
order to refer to coding portions of genes or apea reading frames (ORFs) that correspoiul to the 
transcript/exon sequences provided hereiit 

As used herein^ a "representative fragment of the nucleotide sequence provided hcxein'* 
ref<^ to any portion of these sequences that are not presently rqyresented within a publicly 

10 available database, or more particularly to a collection of fragments, where at least one of the 
members of the collection is unknown, or the entire set has never hem described in it's entirety. 

Those in the art will readily recognize tiiat detection dements that are comprised of 
nucleic acid molecules may be siqyplied as double stranded molecules and that reference to a 
particular sequence on one strand refeis, as well, to the corresponding conqilementary sequence 

IS on the opposite strand. Thus reference to an adenine, a thymine (uridine), a cytosine, oar a 
guanine on one strand of a nucleic acid molecule is also intended to in c lude the thymine 
(uridine), adenine, gufmine, or cytosine, respectively, at the conesponding sites on a 
complementary strand of the nucleic acid molecule. Tims, reference may be made to either 
strand in order to refer to a particular nucleic acid sequence or detedion element 

20 Oligonucleotide, sudi as probes and primers, may be based on, or hybridize to, either strand, 
Hiroughout the text, reference is generally made to the protein-coding strand, only for the 
purpose of convenience. 

The nucleotide sequence information provided herein was obtained by sequencing the 
human genome using a ^otgun sequendng mediod known in the art The nucleotide seqpiences 

25 provided herein are hig^y accurate, although not necessarily a 100% peifeci, representation of 
the set of exonic nucleotide sequences of the human g^ome. 

Using the information provided herein together with routine cloning and sequencing 
methods, one of ordinary skill m the art is able to identify, clone and sequence all ^'representative 
fragments'* of interest including transcripts/exons encoding a large variety of human proteins. In 

30 very rare instances, this may reveal a nucleotide sequence error present in die nucleotide 
sequence disclosed hereiiL Thus, once the present invention is made available (Le., the 
information in the Sequence Listing and Table 1 in a useable form), resolving a rare sequencing 
error would be well within the skill of die art Nucleotide sequence editing software is publicly 
available. 
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Even if all of the veiy rare sequeacing eirors in the sequences herdn disclosed were 
conected, the resulting nucleotide sequence would still be at least 90% identical, and more likely 
99% identical, and most likely 99.99% identical to the nucleotide sequence provided heieirL 

Thus, the present invention further provides nucleotide sequences that are at least 90% 
identical, or greater, to the nucleotide sequences of the present invention in a form fliat can be 
readily used, analyzed and interpreted by a skiUed artisan. Methods for determiriiiig wiietfaer a 
nucleotide sequence is at least 90% identical to the nucleotide sequence of the present mvention 
are roudne and readily available to a dolled artisan. For exan^le, the well known BLAST 
algorithm can be used to generate the percent identity of nucleotide sequences. 

The present mvention also encompasses novel ammo add sequences/proteins^peptid^ 
encoded by the transcripts/exons provided herem. Althou^ tiiese encoded anuno acid sequences 
are not explicitly given, such amino acid sequences can readily be determined using the 
tianscript/exon sequences and structural information provided herem in combination with die 
universal genetic code. Amino add sequences can be readily generated by numerous algorithms 
or computer programs commonly used in die art that simply translale the protein-codrng nucldc 
add sequences provided herein into amino acid sequences based on flie universal ^netic code. 
Such amino add/peptide sequences have commerdally valuable uses similar to those described 
herdn for die transcript/exon nudeic add sequences/fiagments of the present nivention, such as 
design of piotdn detection reagents and computer-based biologicai analysis, for identification of 
commerdaUy important protehois. 



Nucleic Acid Fragments 



Another embodiment of the present mvention is directed to isolated fragments of the 
human genome, particularly those in the form of detection danents or sets of detection 
25 elemoits. Tb& fragments of die human genome of the present invention indude, but are not 
limited to, fragments diat encode pqitides, particularly genes, exons, and transcripts identified 
and described in die Sequence Listing (file SEQUST.TXD and in Table 1 (file TABLELIXI)* 
provided on die accompanying CD labded CLOOIIOIODA. Such isolated fragmfflls of tiie 
human genome, comprising the exon and/or transcript sequences provided herein and frc^mts 
30 thereof, are particularly usefid as detection elemmts, such as for use as probes m a nucldc arid 
array, for detecting geaae expression and oth^ uses. 

For example, die nucldc acid molecdes/fiagments of die present mvention, 
conesponding to the transcript/exon sequences provided h^in, are usefid as probes, primers, 
chemicd intranediates, and in biological assays for ^nes of tiie present invention, particularly 
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geae e}qnession assays. The ptobes/primers can corrospond to one or more of the exons 
provided in Table or one or more of tiie transcripts provided in fte Sequence Listings or may 
span one or more exon-^on junctions identified in Table 1, or can correspond to a specific 
region S' and/or 3' to a transcript or exon provided herein. The transcript/exon sequence and 
stmctural information provided herein are also useful for isolating or amplii^jing any given &xon 
or transcript/gene fiagment of fiie pies^t invaxdon and for designing a variety of goie, or gene 
e?q>ression9 detection reagent/kits. 

A probe/primer noay comprise, for example, a substantially purified exon or transcript 
molecule or an oligonucleotide or oligonucleotide pair tiiat flanks a defined transcript/exon 
sequence. A probe/primer comprising an exon or tianscrqst molecule may comprise the fiill- 
lengOi €xon or transcript sequence, as provided herein, or any portion thereof A probe/primer 
comprising an exon or transcript sequence voay also include S* or 3' flanlring nucleic acid 
sequences, depending on the particular assay. Oligonucleotide piobes^primers may be shorter 
molecules that comprise a nucleotide sequence that hybridizes imder stringent conditions to at 
least about 5, 12, 20, 25, 40, SO, 100 or more consecutive nucleotides that comprise a unique 
sequence specific to tiie target exon or transcript/gene. Depending on the particular application, 
the consecutive nucleotides can either include the target exon or transcript, or be a specific 
region in close enough proximity S* and/or 3* to the exon or transcript to carry out tiie de^red 
assay. 

Furthermore, a preferred class of nmcleic add fiagments are those tiiat span exon-exon 
junctions. Preferred fiagments span one exon-exon junctioit Hoiwever, fiagments may span any 



lumiber of exon«>exon junctions within a transcript Nucleic acid fiagments that span exon-exon 
junctions are particularly useful, when used as detection elements such as probes in an array, in 
that they ftiiTntnatP signals caused by genomic contamination. Exon-exon junctions are 
identified in Table 1. Nucleic add fiagments spannmg exon-exon junctions can readily be 
determined using the exon coordinates provided in Table 1 along with tiie transcript sequences 
provided in the Sequence Listing. 

The isolated nucldc acid molecules of die present invention include, but are not limited 
to, double-stranded or single-stranded DNA or RNA, such as mRNA, cDNA, or genomic DNA 
comprising the exons or transcrq)! sequences provided herdiL Isolated nucldc add molecules 
may be obtained, for example, by cloning or PCR anq>lification, or produced by chemical 
synthetic tedmiques or by a combination thereof. Single-stranded nucldc add can be tiie coding 
strand (sense strand) or the non-coding strand (anti-sense strand). Double-stcanded RNfA 
molecules are useful for, for exan^ile, RNA interference, or gene silendng, ^ch can be used to 
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turn genes off in order to elucidate their fimdioii and may be usrfid Aerqwutic a&sats for 
turning ofif defective, disease-causing genes (see Plastedc et td.. Oar Opin Genet Dev 2000 
Oct;10(5):562-7; Bosher et td.. Nat Cett Biol 2000 Feb;2(2)£31-6; and Hunter, Curr Biol 1999 
Junl7;9(12):R440-2). 

5 "Nudeotide sequence" may refer to ej&a: a heteropolymor of deoxyiibonudeotides, in 

the case of DNA, or a heteropolymex of ribonucleotides, in Ae case of RNA. DNA or RNA 
segments may be assembled, for exan^le, fiom figments of the, human genome or single 
nucleotides, short oligonucleotide linkos, or fiom a series of oligonucleotides, to provide a 

synthetic nucleic add molecxile. 
10 Hie present inv^on provides isolated nucleic acid molecules that contain one or more 

exons or transcripts disclosed by the present invention. Such nucleic acid molecules will consist 

ot consist essentially o^ or comprise one or more exons or transcripts of the presort invention. 
I Ihe nucleic acid molecule can have additional nucleic acid residues, such as nucleic add 

residues that are naturally associated widi it or heterologous nucleotide sequences. 
15 As used herein, an "isolated" nndrac add molecule is one tibat contains an exon and/or 

transcriirt of Ae present invoition and is separated fiom other micldc add 

source of flw nucldc add. The isolated nuddc add, as used herein, will be conqoised of one or 
more exons and/or transcripts disclosed by flie present inveotioiL The isolated nuddc acid may 
have flaoldng nudeotide sequence on dflier side of the exon or transaipt dq)ending on the. 
20 particular use of the isolated nuddc add or assay involved. The flanking sequence may be, for 
exan^ up to about 5,000 base^ 2,500 bases; 1,000 bases; 500 bases; 100 bases, 50 bases, 30 

bases, 20 bases, or 10 bases on dflier ade of an exon or transcript, for detection reagents. The 
important point is that the nuddc acid is isolated fiom remote and unimportant flanking 
sequences and is of appropriate length such that it can be subjected to die spedfic manipulations 

25 or uses such as recombmant e^qnesdon, preparation of probes and primers for expression 
analysis, and oflier uses specific to die transoqit^exon sequences. 

As used herein, an "isolated nucldc add molecule" or an "isolated fisgmeat of the human 
genome" refers to a nuddc add molecule possessing a spedfic nucleotide sequence «*ich has 
been subjected to purification means to reduce, fiom the compoation, die number of compounds 

30 v^4lidl are normally assodated wifli die composition. A variety of purification means fliat are 
well known in die art can be used to genoate the isolated fiagmoits of the {oesent invention. 
ITiese include, but are not limited to, mediods diat sqwrate constituents of a solution based on 
charge, solubility, or dze. Moreova, an "isolated" nucleic acid molecule, such as an mRNA 
molecule containing a transcript sequaice of die present invention or an exon isolated fiom 
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genomic DNA^ can be substantially fiee of other ceUuiar material, or culture medium when 
produced by recombinant techniques, or chemical precursors or other dhemicals "when 
chemically syntiiesized. However, flie nucleic acid molecule can be fiised to o&er codmg or 
regulatory sequ^ices and still be considered isolated For example, recomtbinant DNA molecules 

S contained in a vector are oonsid^fed isolated. Furdier examples of isolated DNA molecules 
include recombinant DNA molecules noaintmned in heterologous host cells or purified (partially 
or substantially) DNA molecules in solution. Isolated BNA molecules include in vivo or in vitro 
KNA transOTpts comprising the sequences of the present inventionL Isolated nucleic acid 
molecules according to tibie present invention further include such molecules produced 

10 syndietically. 

In one embodiment, human DNA can be mechanically sheared to produce fragments of 
about 2kb, lOkb, or 15-20 kb in length. These fiagments can then be used to generate a human 
library by insertiog them into plasmid vectors (or lambda vectors) using methods well known in 
the art Primers flanking, for example a gene or exon, can ttien be generated using nucleotide 

IS sequence infinmation provided in the present inventioiL PCR cloning can then be used to isolate 
the g^e or excm fiom the human DNA library. PCR cloning is well known in the art Thus, 
given the availability of the present idratified gene coding sequences of the human gqaome, it is 
routine experim»tation to isolate any gene or exon, or fiagments thereof particularly using the 
information provided in file, TABLE1.TXT, provided on the accompanying CD labeled 

20 CLOOl lOlCDA Particularly useful is the generation of nucleic acid fiagments comprising one 
or more exons of a gene, particularly those identified herein. Such fiagments can be applied to 
an array, microfluidic device or other detection kit format and used to detect expression of a gene 
(see below). 

The sequences Mling within the scope of the present invention are not limited to the 
25 specific sequences herein described, but also include allelic and species variations theareof. 
Allelic and species variations can be routinely detetmined by comparing the sequences provided 
by the presoit invention, or a representative fragment tiieteo^ with sequences fiom olh^ isolates 
fit>m tiie same species (allelic variations) or fiom other species (qiedes variations). Sequence 
comparisons witfi oi bpr nucleic acid isolates to determine allelic or species variation can be 
30 readily accomplished using the tianscript/exon sequences and structural information provided 
herein. For example, primecs for re-sequencing any particular transcript, exon, or fragment 
thereof can be readily designed based on the sequences provided herein. Such re-sequencing is 
usefiil for detecting polymorphisms, such as SNPs, in the transcripts/exons provided herein. 
Furthermore, such SNPs, being in piotein coding regions, are of sig n ifica nt commercial value 
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since they may diange the encoded protein sequence and thereby play a diiect role in disease 
devdopment and piogressioxL Sudi SNPs axe inqxnrtant targets for flienqpeutic/dxug 
devdopment, and may also serve as iniportant diagnostic/prognostic madoers. Thus, tiie 
transciipt/exon sequences and structural information provided herein is a commercially valuable 
5 resource for SNP detection. 

To accommodate codon variability, the present invention also encompasses nucleic acid 
molecules coding for the same amino add sequoices as do the specific transcript/exon sequences 
disdosed herein. In other words, in the tiansccipt/exon sequences disdosed herein, substitution 
of one codon for another that encodes the same amino add is expressly contem p lated 

The present inv^on further provides related nucldc add molecules that hybridize 
Jtringent conditions to the nucldc add molecules disclosed herein. As used herein, the 
'liybridizes under stringent conditions'^ is intended to describe conditions for hybridization 
and washing under which nucleotide sequences encoding a peptide at least 60-70% homologous 
to each other typically remain hybridized to each odier« The conditions can be sudi that 
sequences at least about 60%, at least about 70%, or at least abotit 80%, or 
more homologous to each other typically remain hybridized to each oflier. Such stringent, 
conditions are known to those stalled in the art and can be found in Current Protocols in.- 
Molecdar Biology, John Wil^ & Sons, N.Y. (1989), 6.3.1-6,3.6. One example of stringent 
hybridization conditions are hybridization in 6X sodium chloride/sodium citrate (SSC) at about 
45^0, followed by one or more washes in 02 X SSC, 0.1% SDS at 50-65^C. Examples of 
e to low stringency hybridization conditions are wdl known in foe art 
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1 for errors by 



can be readily s( 

resequendng a particular firagment, such as an exon or transcript, in both directions Q.e., 
sequence both strands). Alternatively, error screening can be performed by sequencing 
corresponding polynucleotides of human ori^ isolated by using part or all of foe fragments in 
question as a probe or primer. 

Eadi of the tcanscripts/exons of the human genome, indudir^ sequences and isolated 
nucleic add molecules, can be routinely characterized using the conq)Uter system of the present 
invention and can be used in numorous ways as polynudeotide rei^ents. For example, isolated 
nucldc acid molecules comprising at least one of foe exon or transcript sequences provided 
herein, can be used as diagnostic probes or diagnostic amplification primers to detect the 
e)qnession of a particular exon, exon set, gene, or geoe set This is particularly usefol in foe 
form of nucldc add atmys wherein 100 or more, 1000 or more, 5000 or more, 10,000 or more, 
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or even most to all of the exoos/transcripts provided by the present invention aie implemenfeed in 
a single array. 
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Nucleic Add Airavs and Detection Kits and Reagents 

The present invention provides detection kits and reagents, such as» but not limited to, 
arrays, TaqMan probe^prim^ sets, and various compartmentalized kits, conqnising detection 
elements, such as nucleic add probes, that are based on the sequence information provided by 
the pres^ mvention, particularly the transcript sequences (SEQ ID NOS:1-39010) or exon 
sequences (exon infimnation is provided in Table 1). 

As used herein Arrays" or "Microarrays" refers to an array of distinct polynucleotides or 
oligonucleotides synthesized on a substrate, such as papct^ nylon or other type of membrane, 
filter, chip, gjass slide, plastic, silicon, gold, gel or any other suitable solid, or semi-^solid siqyport 
Arrays may also be Irased on fiber^optics and comprise, for example, probes attached to beads at 
the oods of fiber-optic bundles (see Walt, Science 287, 451 (?mO), lAdbaei et aL, AmL Chem 
70, 1242-1248 (1998), and Ferguson etoL, Nature Biotechnology 14, 1681-1684 (1996)). In one 
embodiment, tiie mii 







fi 





y is prepared and used accordiiig to the method 
Patent 5,837,832 (Chee et al.), PCT plication W095/1 1995 (Chee et al.), Lockbart, D. J. et al, 
(1996; Nat. Biotech. 14: 1675-1680) and Schena, M. et al. (1996; Proa NatL Acad Sd. 93: 
10614-10619), all of which are mcorporaied herein m their entirety by reference. In otiier 
^bodiments, such arrays are produced by the methods described by Broiivn et al., US Patent 
No. 5,807^22. Hybridization and scaimmg of arrays is also described m PCT application WO 
92/10092 and BP785280. The use of mi<ax)arrays of oligonucleotides or polynucleotides for 



capturing complementary polynucleotides from expressed genes is also described in Sdiraa et 
al, Science, 270: 467-469 (1995); DeRisi et al. Science, 278: 680-686 (1997); Chee et al. 
Science, 274: 610-614 (1996). Additionally, Freeman et al, {Biotechniques 29, 1042-1055 
(2000), LocUiart et al. (Nature 405, 827-836 (2000)), and Zweigo: (Trends in Biotechnology 17, 
429-436 (1999)) provide reviews of nucldc acid arrays for gene expression analysis and otiier 
uses; also see Nature Genetics 21 (Siqypl.), 1-60 (1999) and Mddrum, Genome Research, 
10:1288-1303 (2000) for an overview of array technology. 

For example, gene expression kits and reacts, such as arrays or sets of probe containing 
beads, may contain one or more detection elements, such as oligonucleotide probes or pairs of 
probes, that hybridize at or near each exon or gene corre^wnding to the exon/transcnpt 
sequences provided by the present invention. A plurality of oligonucleotide probes may be 
included in tiie kit to simultaneously assay large numbers of genesfexons, at least one of which is 
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one of tiie g^ies/exons of the ptesent inveation and nov^ In some kits» 

such as arrays* fte oligonudeotide probes are provided immobilized to a substrate. For exanqiie* 
the same substrate can conqprise oligonucleotide probes for detecting at least 1; 10; 100; 1000; 
10,000 or most or substantially all of the ^nes/transcripts or exons provided by the i 



t 



5 invention. Any number of probes, or other detection elements, may be utilized in a detection 
reagent, depending on tibe particular technology platform and objective. For exampb, a typical 
array may contain hundreds, thousands to millions of individual synthetic DNA probes arranged 



in a gdd<>Iike pattern and miniaturised to the size of a dime, eadi corresponding to a particular 
exon or transcript/gene. Preferably, probes aie attadbed to a solid siqjport in an ordered, 
addressable array. Customized arrays that utilize the exon and/or gene/transcript sequences 
provided by the present invention can be produced by various manufacturers. For example, 
arrays with over 250,000 oligonucleotide probes or 10,000 cDNAs per square centimeter are 
readily available (see lipshulz et al. Nature Genetics, 21, 20-24 (1999) and Bowtall et oL, 
Nature Oemtics, 21, 25-32 (1999). In some arrays, electric fields can be ^lied to the array to 
speed hybridization reactions (see Edman et at.. Nucleic Acids Res. 25, 4907-4914 (1997) and 
Sosnoiwald et oL, Proc. NatL Acad ScL USA 94, 1119-U23 (1997)). Aimys have been 
previously produced for completely sequenced organisms, such as Saccharomyces cerevisiae, 
that comprise probes for every identified gene in the organism^s genome (see DeRisi et aL, 
Science 278, 680-686 (1997) and Wodicka et al., Nature Biotechnology 15, 1359-1367 (1997)). 

The microarray or detection kit is preferably composed of a large number of unique 
nucleic add sequences, usually either syntibetic antisense oligonucleotides or fiagments of 
cDNAs, fixed to a solid support Probes may comprise either siogle- or double-stranded middc 
acid molecules. Oligonucleotides may be about 6*60 nucleotides in length, more preferably 15** 
30 nucleotides in length, and most preferably about 20*25 nucleotides in length. For a certain 
type of miooarray or detection kit, it may be preferable to use oligonucleotides that are only % 
20 nucleotides in length. For others, such as cDNA, longer lengths are possible and preferable. 
These can be of the order of Ikb-Skb or more in length and can conqnise the entire length of a 
transcript or exon sequence provided herein or can comprise a short firagment of the 
transcript/exon, such as in exon-exon junction spanning detection elements. 

The microarray or detection kit may contain oligonucleotides tiiat cover, for example, 
sequential oligonucleotides that cover the full*length sequence, or unique oligonucleotides 
selected fix>m particular areas along the length of die sequence, such as in exon-exon boundaries. 
Additionally, such as in the case of primers for PCR, it may be desirable for oligonucleotides to 
bind to ro^ons 5' or 3' of the transcripts/exons provided herein, such as to capture the mtire 

21 



« 

wo 02/068579 PCTAJS02/00284 

exoa or transcript/gene mtfain the ampUcoB. Polynucleotide used in the microazray or detection 
kit may be oligpnucleotides that are specific to an exon, exons, ^e» or gqnes of interest 

Thus, die chip may compiise an array conqiiising at least one probe conssponding to die 
fiill-lengdi sequence of at least one of the exxms and/or tianscripts provided by the present 

5 invention, sequences spanning one or more exon^on junctions identified in Table 1, sequences 
complementary thereto, or fiagments thereof Thus, the sequence of at least one probe of the 
array is sdected fiom tiie groiq) consisting of those disclosed in SEQ ID NOS:1-39010 and the 
exons idmfififA m Table 1, sequences spanning one or more exon-exon junctions identified in 
Table 1, sequences complementary thereto, and fiagments thereol 

10 In order to produce oligonucleotides to a known sequence for a microarray or detection 

kit, the exon(s) or gene(s) of interest is typically examined using a compute algorithm that starts 
at tire 5* or at the 3* end of tiie nucleotide sequence. Typical algorithms Mdll fbsa identify 
oligomers of defined l^gth that are unique to die exon/g^e, have a GC content within a range 
suitable for hybridization, and lack predicted secondary structure tiiat may mterfere with 

15 hyfaridizatioiL In certain situations it may be appropriate to use pahs of oligonucleotides on a 
microarray or detection kit For example, paks of oligonucleotides are particularly useful for 
detecting migmAtrfi hybridization in hi^Mpnsity arrays tiiat use short oligonucleotides, such as 
2S*mas; such short oligonucleotides are suscq>tible to mismatch hybridization due to &lse 
priming. In this situation, pairs of oligonucleotides with deliberate mismatches are incorporated 

20 to determine the level of m^sm^trh hybridization, which can then be subtracted fix>m the true 
target signal (Lockhart et dL, Nat. Biotechnology (1996) 14:1675-1680 and Wodicka et oL, Nat 
Biotechnology (1997) 15:1359-1366). Pairs of oligonucleotide probes are also useful for 
detecting polymorphisms, particularly SNPs; in these situations, the oligonucleotide pairs are 
g^erally designed to be identical accept for one nucleotide tiiat preferably is located at or near 

25 die center of the sequmce. The second oligonucleotide in the pair (mi s m a t ch ed by one) serves 
as a conlroL The oligomer are synthesized at designated areas on a substrate using a light- 
directed chemical process* The substrate may be paper, nylon or other type of membrane, filter, 
chip, glass slide or any other suitable solid support 

in anotiier aspect, an oligonucleotide may be synthesized on the sur&ce of the substrate 

30 by using a chi^nicai ooiqiling procedure and an ink jet application apparatus, as described in PCT 
application W095/251 1 16 (paldcschweiler et aL) which is incorporated herem in its entirety by 
reference. In another aspect, a "gridded" array analogous to a dot (or slot) blot may be used to 
arrange and link cDNA fiagm^ts or oligonucleotides to the sur&ce of a substrate using, for 
example, a vacuum system, thermal, UV, mechanical or chemical bonding procedure. An array, 

22 



wo 02/068579 PCT/US02/00284 

sudi as those described above, may be produced by hand or by using available device (slot blot 
or dot blot apparatus), iriafffyjuU (any suitable solid support), and madiines including robotic 
instnim^tsX and may contain 8; 24; 96; 384; 1536; 6144; 10>000 or more oligonucleotides, or 
any other number which lends itself to the efficient use of commercially avaUable 
5 instrumentation. 

In other embodimetits, die array or detection reagent/kit can be produced by spotting 
cDNA or other nucldc add molecules onto the sur&ce of a substrate (see Brown et aL, US 
Patent No. S»807»S22). In sudi use, PGR amplification of one or more exons or transcripts &om 
genomic DNA can be used to geaeiate a nucleic acid molecule suitable for deposition onto a 
10 substrate. 

In yet flutoth^ embodiment, die detection reagent or kit comprises TaqMan probe/primer 
sets for carrying out the TaqMan PGR assay, such as for detecting gene expression. The TaqMan 
assay, also known as the 5' nuclease PGR assay, provides a sensitive and rapid means of 
detecting gene expres^on. Hie TaqMan assay detects the accumulation of a specific amplified 

15 product during PGR. The TaqMan assay utilizes an oligonucleotide int)be labeled with a 
fluorescent reporter dye at the 5' end of the piobe and a quendier dye at the 3' end of the probe. 
During the PGR xeacdon, die S' nuclease activity of DNA polymerase cleaves die probe, thereby 
separating the reporter (fye and die qu^adier dye and resulting in increased fluorescence of the 
reporter. Ajccumulotion of PGR product is detected directiy by monitoring the increase in 

20 fluorescence of the reporter dye. The S* nuclease activity of DNA polymerase cleaves the probe 
between die reporter and the quencher only if the probe hybridizes to die target and is amplified 
during PGR. Hie probe is designed to hybridize to a target nucleic acid molecule only if the 
target sequence is complementary to the probe, i.e., if the target sequmce comprises die 
transcript/exon sequence that is used as A probe. 

25 Preferred TaqMan pnmer and probe sequences can readily be determined using the 

nucleic add information provided herein. A number of computer programs, such as Primer- 
E)q)ress> can be used to readily obtam optimal primet^probe sets. It will be qjparent to one of 
ftlrill in the art that the primers and probes based on the nucleic acid and transcript/exon 
sequmces and structural information provided herein are usefol as probes or amplification 

30 primers for screening for die transcripts/exons provided by the present invention, such as for 
monitoring gene expression in particular disease conditions, and can be incorporated into a kit 
format In particular, genome-wide TaqMan probe/primer sets are specifically contemplated for 
monitoring the expression of 10,000 or more, or most or all, human g^es, or any subset thereof 
of interest Such genome-wide TaqMan probe/primer sets can readily be obtained using the 
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transcript s^uences and transcript/exon structural information provided hex^ along with a 
primsc/probe design computer program, such as Primer-Express. 

Other detiection kits and reagsmts may be based on blotting tedbniques such as northern 
blots (for detecting RNA), southern blols (for detectmg DNA). or western blots (for detecting 
5 protrins) or beads containing detection dem^ tibat are well known in the art The exons and 
transcript sequences provided by the present mv^on are well suited for use as detection prob^ 
in such techniques. 

Direct sequencing, m^tif^mg cDNA sequencing, can also be used to detect the transcripts 
and/or exons of tibe present invention. A variety of automated sequencing procedures can be 
10 utilized when performmg detection/diagnostic assays ((1995) Biotechniques 19:448), including 
sequencing by mass spectoomeUy (see, e.g., PCX International Publication No. WO 94/16101; 
Cohen et at., Adb. Chromatogr^ 36:127-162 (1996); and GrifBn et al., Appl Biochem. 

BhtecfmoL 38:147-159 (1993)). 

Various other methods useful for gene esqiression analysis mclude, but are not limited to, 

15 RT-PCR, nuclease protection, done hybridization, differraitial di^la^r (Liang et aL, Science 257, 
967-971 (1992)), subtractive hybridization, cDNA fin^rprintmg (Shhnkets et aL, Nature 
Biotechnology 17, 798-803 (1999), Ivanova, Nucleic Acids Research 23, 2954-2958 (1995), 
Btato, Nucleic Acids Research 23, 3685-3690 (1995), and Bachem et oL, Plant J. 9, 745-753 
(1996)), icporter-gwe analysis, two-dimensional (2D) gel electrophoresis, mass spectrometry, 

20 and serial analysis of gene expression (SAGE) (Velculescu et al.. Science 270, 484-487 (1995)). 

In order to conduct sample analysis usmg a microarray or other detection reagqnt/kit, a 
typical procedure may be similar to the following. The RNA or DNA &om a biological sample is 
made into hybridization probes. The mKNA is isolated, and cDNA is produced and used as a 
template to make antisense RNA (aRNA). The aRNA is anq)lified in the presence of fluorescent 

25 nucleotides, and labeled probes ate incubated with the micxoanay or detection kit so that the 
probe sequences hybridize to compl^entary oligonucleotides of the microairay or detection kit 
Incubation conditions may be aiQusted so (hat hybridization occurs witii precise complementary 
matches or with various degrees of less complementarity. After removal of nonhybridized 
probes, a scanner is used to determine the levels and patterns of fluorescence. The scanned 

30 hnages are examined to determine degree of complementarity and tfie relative abundance of each 
oligonucleotide sequence on the miaoarray or detection kit The biological samples may be 
obtamed firom any bodily fluids (such as blood, urine, saliva, phlegm, gastric juic^, etc.), 
cultured cells, biopsies, or other tissue preparations. A detection system may be used to measure 
tiie absence, presence, and amount of hybridization for all of flie distinct sequences 
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simultaneously. This data may be used for purposes includii^ but not limited to, large-scale 
conelatton studies on &e sequences, esxpxession patterns, mutations, variants, or polymoiphisms 
among samples. 

Using such anays, the present invention provides methods to identify the expression of 
5 one or more of the exons or transcripts/genes of tiie present invration. Such methods may 
comprise incubating a test sample vA&. an array comprising one or more oligonucleotide probes 



coir 



15 type 



ling to at least one won or transcript of the preset invotion and assaying for binding 
of a nucleic add £tom the test sample with one or more of the oligonucleotide probes. Such 
assays will typically involve arrays comprising most, if not all, of the exons or transcripts in tbe 
human gmome, or rationally selected subsets thereol The transcript sequences of the human 
genome are provided in SEQ n> NOS:1"39010 and Hxe exons that these transcripts are conqmsed 
of are provided in Table L 

Conditions for incubating a nucleic acid molecule with a test sanoqple vaty. bkcubadon 
conditions depend on the format employed in tibe assay, tfie detection me&ods employed, and the 

ie of the nucleic add molecule used in the assay. One skilled in the art iviU 
recognize that any one of the commonly available hybridization, amplification, or array assay 
formats can readily be adqited to employ^ novel fragments of the human genome disdosed 
herein. Examples of such assays can be found in Chard, T, An Introduction to 
Radioimmunoassay and Related Techniques, Elsevier Sdence Publishers, Amsterdam, The 
Netherlands (1986); Bullock, G. R. et aL, Techniques in Imnmnocytochemistry, Academic 
Press, Orlando, FL VoL 1 (1982), VoL 2 (1983), Vol. 3 (1985); Tijssen, P., Practice and 
Theory of Enzyme Immunoassays: Laboratory Techniques in Biochemistry and Molecular 
Biology, Elsevier Scirace Publishers, Amstenhun, The Nefoeriands (1985). 



The test samples of foe present invention include, but are not limited to, nucleic acid 
extracts, cells, aiki protein or membrane extracts fiom cells, which may be obtained fiom any 
bodily fluids (such as blood, urine, saliva, phlegm, gastric juices, ete.), cultured cells, biopsies, or 
ofoer tissue prqiarations. The test sample used in the above-described methods ivill vary based 
on foe assay format, nature of the detection metiiod and the tissues, cells or extracts used as the 
sample to be assayed. Mefoods of prq)aring nucleic acid, protein, or cell extracts are well 
known in foe art and can be readily be adapted in order to obtain a sample that is compatible wifo 
foe system utilized. 

In anofoer embodiment of the present invention, kits are provided which contain foe 
necessaiy reagents to cany out one or more assays for detecting foe exons/transcripts/genes of 



tiie 



t invention, such as for gene expression analysis. Specifically, foe invention provides a 
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compartmeatalized kit to receive^ in dose confinenient, one or moie containers, comprising: (a) 
a fiist container conq^irising at least one nucleic acid molecule that can bind to a fragment of at 
least one of tibe exon or transcript sequences disclosed herdn, including ^on-won qmnning 
seqi^ces; and (b) one or more other containers comprising wash reagents and/or reagents 

5 equable of detecting pr^ence of a bound nucleic acid Prefexied kits will include detection 
le^Oits/artays/chips/microfluidic device that are capable of detectu^ the escpression of 1 or 
more, 10 or more, 100 or more, SOO or more, 1000 or more, 10,000 or more, or most or all of the 
exons or transcripts identified herein that are expressed in humans. One skilled in the art will 
readily recognize that the previously unidentified exons/transcripts provided by the preset 

10 invention can be readily incorporated into one of the established kit formats which are well 
known in the art, particulariy e3cpressiQnarmys* 

In detail^ a compartmentalized kit includes any kit in which reagents are contained in 
separate containers. Sudi containers include small gjass containers, plastic containers, strips of 
plastic, glass or paper, or arraymg material such as mlica. Such containers allow one to 

« 

IS eflSdendy transfer reagents from one compartment to another compartmrat such that tiie 
samples and reagents are not cross-contaminated, and the ag^ts or solutions of each container 
can be added in a quantitative fiiddonfiom one compartment to ano& Such kits may typically 
include a container which will accqpt the test sample, a container which contains the nucleic acid 
probe, containers which contain wash reagents (sudi as phosphate bufibred saline, Tris-buffers, 

20 etc.), and containers which contain the reagents used to detect the bound probe. The kit can 
fiirtiber con^oise reagents for PGR, RT-PCR or other enzymatic reactions, and instructions fijr 
using the kit Such compartmentalized kits include multicomponent integrated systems. 

Multicomponent integrated systons may also implement the transciqit/lexon sequences, 
mduding «con-exon spanning sequences, provided by the pr^ent invoition as detection 

25 elements. Mulficomponent int^rated systems include such systems as microfluidic devices, 
biomedical micro-electio-mediamcal systems (bioMEMS), and *'lab-on-a-chip'' systems (see, for 
sample, US patents 6,153,073, Dubrow et al,, and 6,156,181, Parce et al.y Such systems 
nripiflfainze and compartmentalize processes such as probe/target hybridization, PCR, and 
cs^illary electrophoresis reactions in a single functional device, and may be integrated with 

30 nucleic acid arrays. An example of such a technique is disclosed in US patmt 5,589,136, which 

describes the integration of PCR amplification and capillary electrophoresis in chips. 

Multicomponent integrated systems such as microfluidic, bioMEMs, and lab-on-a-^hip systems, 

generally comprise a pattern of microchaimeLs designed onto a glass, silicon, quartz, or plastic 

wafer included on a microchip. The movements of the samples are controlled by dectric, 
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elecboosmolic^ or hydrostatic forces nppMoi across diffiaimt areas of the microchip to create 
functional microscopic valves and puny^ vnAi no mo vii^ parts. Varying ttie voltage controls the 
liquid flow at inteisections between the micro-machined channels and changes the liquid flow 
rate for pumping across different sections of the microchip. 

5 

Medical" and Phannacentical-Reialed Uses 

Detection of gene expression, using the transcripts and/or exons of the present invention, 
is valuable for such uses as disease diagnosis, monitoring disease progression, detennining the 
effects of various treatments/therapeutics, and individualizing medical tieatmrat or drug therapy 

10 based on an individual's gene expression patterns. In particular, uses such as tfiese can be 
achieved using the detection reagents provided by the present invention, such as nucleic add 
arrays that utilize the human exon and/or transcript sequoices provided by Ae present invention 
as detection elements. 6enom&-wide e}qiression analysis can be conducted in humans using the 
exons/transcripts provided herdn; genome-wide expression analysis has previously bem 

15 accomplished in yeast (Holstege et oL. CeU 95, 717-728 (1998)). 

Detection reagents, such as armys, containing the transcripts/exons of the present 
mvention can also be used to probe genomic DNA for changes in gene copy number or allelic 
unbalances (see Mei et al.. Genome Res 2000 Aug;10(8):1126*37, PoUack et d.. Nature 
Genetic 23, 41-46 (1999), and Pinkel et a/. Nature Genetics 20, 207-211 (1998)). Such copy 

20 number changes/allelic imbalances may be caused by gene or chromosome deletions or 
duplications, which may occur in cancerous cells and otiier disorders. FurthiOTiore, identification 
of genetic/dnomosomal changes such as these may facilitate the identification of specific genes, 
regulatory/control r^ons, or other genedc elements that play important roles in the disorder, or 
indicate that a particular chromosomal region harbors such elements. 

25 The sequences and detection reagents of the present invention may be used to determine 

whether an individual has a mutation or polymorphism, such as a SNP (single nucleotide 
polymorphism), affecting die level the concentration of mRNA or protein in a sample, etc.) 
or pattern (ie., the kinetics of esqnession, rate of decomposition, stabiUty profile. Km, Vmax, 
etc.) of gene expression in a particular cell, tissue, bodily fluid, disease state, or developmental 

30 stag^. Such variations in gene ^ression can be caused, for example, by a SNP in a gene, or in a 

regulatory/control region(s), such as a promoter, or other gene(s) that controls or affects the 

expression of tiie gene. Such an analysis of gene expression can be conducted by screening for 

mRNA corresponding to the exons and/or transcripts provided by the present invention. Once 

changes in gene ejqjression patterns are identified, the nucleic add sequences provided by the 
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present invention can be used, for csxample. to design primers^nx^bes for SNP-d^ecdon assays to 
detcnnine if a SNP is responsible for ttie variation in gene expression patterns. Such SNP-* 
detection assays include, but are not limited to, direct sequencing, mini-sequendng primer 
extension, and the TaqMan PGR assay, or any odier SNP^detection technique known in the art 

5 Furtfaennore, SNP-^etection assays may utilize nucleic acid arrays, mass spec, or other 
technology platforms used in the ait for SNP-detection. Once a SNP is detected that alters gme 
expression in a maimer that contributes to a pathological condition, Aerapeutic approaches can 
be targeted at that SNP and, furdiemiore, that SNP can serve as a diagnostic/prognostic marker 
for the disease, and may form the basis of a diagnostic kit for tho disease. Furthermore, SNPs in 

10 die tzanscript^exon coding sequences provided herein can readily be detmnined by comparing 
the sequences provided herein against coiresporuling transcript/exon sequences fiom nucleic acid 
isolates taken &om different individuals, such as by re-sequencing or computer-based sequence 
database comparison. Additionally, changes m die amino add/protein sequences caused by such 
SNPs can readily be determined using the sequences provided herein as a reference and diie 

15 universal genetic code. 

Medical gene expression analysis can mchide the stqps of collecting a sample of cells 
fiom apatient, isolating mKNA from the cells of the sample, contactuig the roRNA sample with 
one or more probes, based on the exon and/or transcript sequences provided herein, which 
specifically hybridize to a region of the isolated mRNA containing a target exon/transcript under 

20 conditions sudi that hytiridization of the probe with the exon/tianscript occurs, and detecting fixe 
presence or absence of hybridization. The presence or absence of hybridization, and therefore of 
the target exon/transcript, can then be correlated with known gene expression patterns in, for 
example, normal cells/tissues and in oeUs/tissues in various disease stages in order to, for 
example, diagnose a H|s^s^^ determine disease progression, or detarmine the effect of a 

25 particular drug treatmrat 

The contribution or association of particular gene expression patterns with disease 
pbenotypes enables the transc^pts/exons of the present invention to be used to develop s(q)erior 
diagnostic tests based on gene expression/mRNA markers. Such gene expression-based 
diagnosdc tests are usefiil for identifying individuals who have a gene expression indicative of a 

30 specific disease or disease propensity or individuals whose gene expression patterns indicate that 
a particular drug treatment or therapeutic q)proach should be utilized. For example, HER2 and 
the estrogen receptor genes are known to be expressed at increased levels in cancers, such as 
breast and ovarian cancer (van de Vijver et a/. (1988) New Engl J. Med. 3 19, 1239-1245, Berger 
et al (1988) Cancer Res. 48, 1238-1243, and PetrangeU et aL (1994) I Steroid BiochenL Mol 
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BioL 49, 327-331) and detenmining the expression level of th^ genes may aid physicians in 
choosing the most effective treatment (McNeil et al (1999) 1 Natl Cancer Inst. 91, 110-112, 
Leinster et al. (1998) Biochem Soc. Symp. 63, 185-191, and Revillon et al. (1998) Eur. J. Cancer 
34, 791-808). Sudbi diagnostics may be based on a single tianscript/gene or exon, a groiq) of 
transcripts/genes or exons, or most or all transoipts/genes or exons provided by the present 
inventioiL 

The invention furttier provides a mediod for idaitifying a con^und that can be used to 
treat a disorder associated with expresaon of a diseas^associated gene or variable^ disease- 
associated, expression of a normal gene. Forms of grae expression such as diese are collectively 
refeaed to herein as disease-associated gene expression, and may contribute to, for example, 
disease or developm^ital disorders. The method typically includes assaying the ability of the 
compound to modulate the activity and/or expression of the target nucleic acid and thus 
identi^ing a compound that can be used to treat a disorder characterized by undesired activity or 
ecqyression of the nucleic acid. 

The assays for disease-associated nucleic acid expression can be acconqplished using the 
transcript and/or exon sequences inrovided by fte present invention as gpne expression detection 
elements, such as probes in a nucleic arid array. The assay for disease-associated nucleic add 
expression can involve direct assay of nucleic acid levels, such as mRNA levels, or on collateral 
compounds involved in the signal pathway. Further, the expression of genes that are up- or 
down regulated in response to the disease-associated protein signal pathway can also be assayed. 
In this embodiment the regulatory regions of these genes can be operably linked to areporter 
gene such as ludferase. 

Thus, modulators of disease-associated gene expression can be identified in a mefliod 
wherein a cell is contacted with a candidate compound, such as a drug or small molecule, and the 
expression of mRNA determined The level of expression of disease-associated mRNA in tiie 
presence of the candidate compound is compared to tiie level of expression of disease-associated 
mRNA in die absence of the candidate conqiound. Tlie candidate compound can then be 
identified as a modulator of nucleic acid expression based on this comparison and be used, for 
example, to treat a disorder characterized by disease-associated gene expression. When 
expression of mRNA is statistically significantiy greater in tiie presence of tiie candidate 
compound tiian in its absence, the candidate compound is identified as a stimulator of nucleic 
add expression. When nucleic acid expression is statistically significantiy less in tiie presence of 
the candidate compound than in its absence, the candidate compoimd is identified as an inhibitor 
of nucleic acid expression. 
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Tbs invention fiirther provides methods of treatcneat, with one or more of tiie 
genes/transcrq)ts/exons provided by the present invention as a taxget, using a compoimd 
identified tfuoug^i drug scre^iing using the transcript/exon sequences provided herein, as a gene 
modulator to modulate nucleic acid expr^on. Modulation includes both up-regulation (i.e. 
activation or agonizaiion) or down-regulation (siq>pression or antagonization) of nucleic acid 
expression. These methods of treatment include tiie stq> of admmistering tiie modulatois of ^e 
eacpr&ssion in a pharmaneutical conqposition to a subject in need of sudb treatment 

The exon/bansaript sequences provided herdn are also useful for monitonqg the 
efifectivooess of modulating compounds on tiie expression or activity of a gene in clinical trials 
or in a treatment regimen. Thus, tiie gene expression pattern can serve as a barometer for the 
continuing efifectiveness of treatment vn^ the compound, particularly with compounds to iwfaich 
a patient can develop resistance. The gene expression pattern can also serve as a marker 
indicative of a physiological response of the affected cells to the compound. Accordingly, such 
monitoring would allow eittier increased administration of tiie compound or tiie administration of 
alternative compounds to which the patient has not become resistant Similarly, if the level of 
nucleic add expression foils below a desuable level, admirustration of the compound could be 
commensurately decreased. Therefore, the transcript/exon sequences of the present invention ate 
particularly usefol for improving the process of drug development by allowing changes in gene 
expression patterns in response to candidate compounds/drugs to be determined; such changes in 
gene expression patterns can be analyzed to determine compound/drug efficacy and/or toxicity. 
This not only unproves the safety of clinical trials, but also will enhance the chances tiiat the trial 
will demonstrate statistically significant efficacy by allowing the clinical trials to be adjusted in 
response to different gene expression patterns observed in different patients in response to a 
candidate compound/drug. Furthermore, gene e3q>ression analysis using the transcripts/exons of 
the present mvention may help explain why certain, previously developed drugs performed 
poorly in clinical trials and may help identify a subset of tiie population that would benefit ficom 

a drug that had previously performed poorly in clinical trials, thereby **rescuing" previously 

-* 

developed drugs. 

Gene expression analysis using the detection reagents of the present invention is also 
useful for determining tiie target of a drug. For example, gene expression patterns in cells treated 
with a drug can be compared to gene expression patterns in cells that have had individual gsnes, 
particularly genes corresponding to the exons/transcripts provided herein, inactivated. A similar 
gene expression pattern would indicate tiiat the drug may target the gene tiiat had been 
inactivated. 
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Gene expre^on analysis, using the ttansoipts/exons provided by the ptese&t invention, 
may also be nseful in forensic and medicolegal investigations. For example, post-mortem ^le 
expression analysis may provide clues as to cause of death or time of death, may mdicate 
exposure to toxic compoxmds or drugs, and may aid in identification. 

Examples of other important uses of die transcripts/exons provided herem for gene 
e^qiression include, but are not limited to, detemiining the toxicological consequences of altered 
gene expiession (Peanie, Toxicol Lett 2000 Mar IS; 112-113: 473-7), understanding chan^ in 
gene expression in response to infection (Mango: et oL, Curr Optn Immmol 2000 



Apr;12(2):21S-8) and 
10 regulating the express! 
Jan;7(2): 120-5). 
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Expression Modulating Fr apnients 

The present invention is useful for unraveling and characterizing the conq>lex g^tic 
network involved in the regulation and control of gene expression. For example, the present 
invention fodlitates the identification and diaracterizadon of regolatDfy/control elements in the 

ating fiagments" (EMFsX or expiession 
iression modulatmg fragment,^ means a 



series of nucleotide molecules that 



id 



ate the expression of an operably hnked 
nur^ also include gene products sudi as transcripdonal 
gulated genes, referred to as *^gulons'*, can^ also be 
ified. Genomic features such as novel EMFs and regulons can be identified, for example, 
through genome-wide expression analysts using arrays comprising die exons/transcripts provided 
by the present invention. Genomic sequence motifs that are statistically over-abundant in regions 

* 

close to similarly expressed genes, particularly in 5' regions, may be identified as novel EMFs, 
such as ci5-regulatoiy elements. Furthermore, using genome-wide expression analysis, one can 
determine whether an EMF has a global effect (affects a large nrnnber of genes, or all grates) or a 
specific efiEect (affects a small number of genes, or a angle gene) (Holstege et al., Cell 95, 717- 
728 (1998)). AdditionaUy, by providing a tool for monitoring gene/feanscript expression, the 
present invention is also usefoi for monitoring variations in gene^transcript expression in 



1 




i 


111 





mutations or polymorphisms in EMFs based on variations in gene expression. Sudi 
polymorphisms in EMFs, particularly SNPs, may be usefiil diagnostic markers for disease. 
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As used herein, a sequence is said to "modulate the expr^siou of an opeiably link 
sequence" when die e^qixession of the sequence is altered by the presence of the J^iF. EMFs 
include, but are not Umited to, promote and promoter modulating sequences O'l^ducible 
elements). One class of EMFs is comprised of firagmeots that induce fte expression of an 
opembly linked gene/transcript in response to a specific regulatory &ctor or physiological eveoL 

genome by their proximity to fte 



EMF sequences can be identified wilhm the 1: 
transcripts/exons provided by the present invention, 
uitergeaic segment fiom about 10 to 200, 10 to 500, 10 to IkB, or 10 to 2*SkB nucleotides m 
length, preferably taken 5' fix>m any one of the transcripts idoittfied in tiie Sequence Listing (file 
10 SEQUST.TXT), provided on the acconqwuying CD labeled CLOOl lOlCDA, will modulate the 

^perably linked 3* g^ne/transciipt in a'&shion similar to that found with ^ 
jenc/transcript sequence. As used hetem, an '*inteigenic segment" * refers to 
fiagments of the human genome that are between two transcripts herein described. 
Alternatively, EMFs can be identified using known EMFs as a target sequence or target motif in 
15 the computer-based systems of the present invCTtion. 

The presence and activity of an EMF can be confirmed using an EMF trap vector. An 
EMF trap vector contains, a cloning site 5* to amaricer sequ^ce. A marker sequence encodes an 
identifiable jdlienotype, such as antibiotic resistance or a conoqplementmg nutrition auxotrophic 
&ctor, which can be identified or assayed when the EMF trap vector is placed withm an 
20 Impropriate host under iqppropriate conditions. As described above, an EMF will modulate the 
expres^on of an openibly linked marlcer sequooce. A sequence which is suspected as being an 
EMF is cloned in all three reading fiames in one or more restriction sites iqistream fixnn tiie 
marker sequ^ice in the EMF trap vector. The vector is then transformed into an appropriate host 
using known procedures and the phenotype of the transformed host in examined xinder 
25 ^propriate conditions. As described above, an EMF wiU modulate the expression of an 
operably linked maricer sequence. 



Computer Related Bmhfidiments 

The nucleotide sequences provided by the present invention, a representative fiagment 

30 tiiereot or nucleotide sequences at least 99% identical to tiiese sequences, may be "provided** in 

a variety of mediums to fiicilitate use thereof. As used herein, ^'provided'' refors to a 

manu&cture, other than an isolated nucleic acid molecule, that contains a nucleotide sequence of 

the present invention, i.e., the nucleotide sequences provided in the present invention, a 

representative fi:agment thereof or nucleotide sequences at least 99% identical to these 
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sequences. Such a manu&ctuie provides the coding poidon of the human genome or a subset 
thereof (e,g., a hitman cxon or traascript sequence) in a form ttmt alloi^ a stalled artisan to 



examine the manufacture using means not directly applicable to ocamitiing the human genome or 
a subset thereof as it exists in nature or in purified form. 



any medium that can be read and accessed directly by a computer. Sudii media include, but are 
not limited to: magnetic storage media, sudi as floppy discs, hard disc storage medium, and 
magnetic tape; optical storage media such as CD-ROM; dectrical storage media such as RAM 
10 and ROM; and hybrids of these categories such as magnetic/optical storage media. A skilled 
artisan can readily appxdate how any of the presently known computer readable mediums can 



(provided in file SEQLKT.TXT on the accorapanymg CD labeled CLOOl lOlCDA. 

As used herein, *h:ecorded" refers to a process for storing information on computer 
readable medium- A skiUed artisan can readily adopt any of the presenfly known methods for 



inventiorL The choice of die data storage structure will g^erally be based on the means chosen 
to access the stored information, hi addition, a variety of data processor programs and formats 

25 can be used to store the nucleotide sequence information of the present invention on computer 
readable medium. The sequence mfonnation can be rqxresented m a word proces^ng text file, 
formatted in commercially-available software such as WordPerfect and Microsoft Word, or 
represented in tibe form of an ASCII file, stored in a database application, such as 0B2, Sybase, 
Oracle, or the like. A skilled artisan can readily adapt any number of data processor structuring 

30 formats (e.g., text file or database) m order to obtain conqniter readable medium havmg recorded 
tibereon the nucleotide sequence infoimation of the present invoition. 



thereof or nucleotide sequences at least 99% identical to these sequences, in computer readable 
form, a skilled artisan can routinely access the sequence information for a variety of purposes. 
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Computo software is publicly available which aUows a skills ardsaa to access sequoace 
iiifonnatio& provided in a conq)uter readable medium. Software MAieh ii^pleaiBQts the BLAST 
(Aitschva et at, J. Mol BioL 215:403-410 (1990)) and BLAZE (Bnitlag et at, Comp. Chem. 
17:203-207 (1993)) search algorithms on a Sybase system may be used to identify 
5 exons/transcripts within the human genome which contam homology to nudeic acid or protems 
sequences fixjm oflifir organisms. Such ex<ms/lianscripts arc protein-ew»ding fr^meats witiiin 
the human genome and are useful in produdng commerciaUy important proteins such as 
tf)et^>eutic proteins. 

The present mvoition fiudier provides systems, particularly con^niter-based systems, 
10 which contain the sequence infoimation described herein. Such systems are designed to identify 
r-n imner eially important fragments of the human genome. 

As used herein, a "oonqmter-based system" refers to flie hardware means, software 
means, and data storage means used to analyze tiie nucleotide sequence mformation of Ae 



computer-based systems are suitable for use in tiie present mveotion. Such system can be 
changed rata a system of the present invention by utilizing the sequence infoimation provided on 
the CD-R, or a subset thereof, without any expemnentatioiL 
20 As stated ^ve, fbo ccHnpular-based systems of flie presrait mvention comprise a data 

storage means havmg stored therein a nucleotide sequence of the present invention and the 
necessary hardware means and software means for starting and iniplementing a search means. 

As used herdn, "data storage means" refers to memory vdiich can store nucleotide sequence 
information of the presait invention, or a memory access means whidi can access manufectures 
25 having recorded thereon tite nucleotide sequence information of tiie present uxvention. 



sequence information stored withm the data storage means. Search means are used to identify 
fiagments or regions of the human genome fbat match a particular target sequence or target 
30 moti£ A varied of known a^orilhms arc disclosed publidy and a variety of commodally 
available software for conducting search means are available and can be used in Ae computer- 
based systems of the present invention. Examples of such software include, but are not limited 
to, MacPattem (EMBL), BLASTN and BLASTX (NCBIA). A skilled artisan can readily 
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recognize that any one of the available algorithms or implementing software padcages for 
conducting homology searches can be adapted for use in the present computer-Jjased systems. 

As used herein, a "target sequence" can be any DNA or amino acid sequence of sfac or 
more nucleotides or two or more amino acids. A skilled artisan can readily recognize fliat ttie 
longer a target sequence is, the less likely a target sequence will be present as a random 
occurrence in the database. The most preferred sequmce length of a target sequence is ftom 
about 10 to 100 amino adds or fit>m about 20 to 300 nucleotide residues. However, it is well 
lecoguized tbat seaidies for commerdally inqK>rtant fiagments of the human genome, sudi as 
sequence fiagments involved in g^ esquession and protein processing, may be of shorter 
length. 

As used hinem, "a target stnictoxal motif;" or 'larget motif;" refers to any rationally 
selected sequence or combination of sequraces in which tiie sequence(s) is chosen based on a 
tiiree-dimensional configuration which is formed upon Hie folding of the target motif. There are 
a variety of targ^ motifs known in the art Protdn target motlEs include, but are not limited to, 
enqrmatic active sites and signal sequences. Nucleic acid target motifs include, but are not 
limited to, promoter sequences, hairpin structures and indudble expression elements (protein 
binding sequences). 

« 

A variety of structural formats for the input and output means can be used to input and 
output the information in the computer-based systems of the present tnv^on. A preferred 
format for an output means ranks fiBgments of the human g^me possessing varying degrees of 
homology to tiie target sequence or target moti£ Sudi presentaticux provides a skilled artisan 
livith a ranking of sequences which contain various amounts of the target sequence or target motif 
and identifies the degree of homology contained in the identified ftagment 

A variety of comparing means can be used to compare a target sequence or target motif 
with the data storage means to identify sequence fiagments of the human genome. For example, 
software which implements the BLAST and BLAZE algoridmis (Altschul ei cd.. J Mol BioL 
215:403-410 (1990)) can be used to identify sequence fiagm^ of interest wifiiin the human 
genome. A skilled artisan can readily recognize that any one of the publicly available homology 
search programs can be used as ttie search means for the computer-abased systems of the present 
invention. 

One i^lication of this ^bodim^ is provided in die figure. The figure provides a 
block diagram of a conyniter system 102 that can be used to implement the present inv^ition. 
The computer system 102 includes a processor 106 coimected to a bus 104. Also connect to 
the bus 104 are a main memory IQg (piefisrably implemented as random access m^ory, RAM) 
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20 



and a variety of secondaiy stora^ devices 110, such as a hard drive 112 and a rraiovable 
medium stmage device 114. The removable medium storage device 114 may represent, for 
example, a floppy disk drive, a CD-ROM drive, a magnetic tape drive, etc- A removable storage 
medium 116 (such as a floppy disk, a conq)act disk, a magnetic tjq*. etc.) containing control 
logic and/or data recorded therein may be inserted into the removable medium storage device 
1 14. Ihe onnqputer system 102 includes appropriate software for reading Ae conliol lo^c and/or 
flie data fiom the removable storage medium 116 once inserted in the removable medhmi sinrage 
device 114. 

The nucleotide sequoices of the present hxvoitioii mi^ be stored in a well-kno!wn vaanaa 
in the main memory 108, any of the secondary storage devices 110, and/or a ronovable storage 
medium 116. Software for accessnig and processiiig the nucleotide sequence (such as search 
tools, comparing tools, etc.) reade m main memory 108 durmg eacecotion. 

M puUications and patents mentioned in the above specificati 

byreftraace. Various modificatirais and variations of tiiie described 
mveation wiU be qjparent to those aldUed m the art without departing finm tbe saq^ 

of die invention. Although the mvention has been described m connection with spedfic 
preferred emboduneols, it.ahould be understood that the mvaition as dauned ^ould not be 
unduly limited to such spedfic embodiments. Indeed, various modifications of the above- 
described modes for carrying out the mvaitiOTi vMch are obvious to those dolled in the field of 
molecular biology or related fields are intended to be withm the scope of the foOowing datms. 



is and systems of the 
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Oaims 

lliat which is claimed is: 

1) An isolated nucleic acid detection reagent that is equable of detecting the pr^ence 
of 100,000 or more human exof^ wherein said eKons are selected &om the group 
consisting of those identified in Table 1. 

2) The detection reagent of claim 1, wherein said reagent is a nucleic add array. 

3) The airay of claim 2, wfaerem said array is comprised of short oligonucleotides 
from about S to about 100 nucleotides m length. 

4) The array of claim 2, herein said array is comprised of polynucleotides based on 
the transcript sequences (SEQ ID NOS:1-39010), vfhcKm said polynucleotides 
are fiom about 100 to about 1000 nucleotides in lengttt 

5) An isolated nucleic add detection rea^nt tiiat is capable of detecting the presence 
of 2000 or more human exons^ wheacein said exons axe selected fix>m the gro^ 
consisting of those identified in Table L 

6) The detection reagent of claim 5, Wherein said reagent is a nucldc add array. 

7) The array of daim 6, wherein said array is comprised of short oligpnudeotides 
fiom about S to about 100 nucleotides in length. 

8) The array of claim 6, herein said array is comprised of polynucleotides based on 
the transcript sequences (SEQ ID NOS:1-39010), wherein said polynucleotides 
are fiom about 100 to about 1000 nucleotides in leootgOL 

9) An isolated nucldc add detection reag^ tiiat is capable of detecting the presence 
of SOOO or mote human exons^ wherein said exons are sdected fixmi the groiq) 
consisting of those identified in Table L 

1 0) Hie detection reagent of claim 9» wherein smd reagent is a nucldc add army. 

1 1) The array of claim 10» wherein said array is comprised of short oligonucleotides 
fiom about 5 to about 100 nucleotides in length* 

12) The array of claim 10, i^ierein said array is conqnised of polynucleotides based 
on the transcript sequences (SEQ ID NOS:1-39010), wherdn sdd polynucleotides 
are fiom about 100 to about 1000 nucleotides in length. 

13) An isolated nucleic add detection reagent that is cc^sable of detecting tbe presence 
of 10,000 or more human exons, wherein sdd exons are selected fiom the group 
consisting of those identified in Table 1. 

14) The detection reagent of claim 13, wherdn sdd reagent is a nucleic add array. 
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15) The array of claim 14, wherein said array is comprised of short oligonucleotides 
fiom about S to about 100 nucleotides in l^ogAu 

16) The array of claim 14, wherein said array is comprised of polynucleotides based 
on the transcript s^juenc^ (SEQ ID NOS:1«*39010), wh^in said polynucleotides 
are from about 100 to about 1000 nucleotides in length. 

17) The detection reagent of claim l^vidieirein said reagent is comprised of at least one 
polynucleotide spanning at least one exon^^on junction identified in Table 1 . 

1 8) The detection reagent of claim 5 , 'v^beacdn said reagent is comprised of at least one 
polynucleotide spanning at least one exon-exon junction identified in Table L 

1 9) The detection reagent of daim 9, ^riierein said reagent is comprised of at least one 
polynucleotide spanning at least one exochexon junction identified in Table 1. 

20) The detection reagent of claim 13^ wherein said reagent is conqnised of at least 
one polynucleotide spanning at least one exon-exon junction identified in Table 1. 



38 




FIGURE 1 



